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Abstract 

Large-scale platforms currently experience errors from two different sources, namely fail-stop 
errors (which interrupt the execution) and silent errors (which strike unnoticed and corrupt 
data). This work combines checkpointing and replication for the reliable execution of linear 
workflows on platforms subject to these two error types. While checkpointing and replication 
have been studied separately, their combination has not yet been investigated despite its promis¬ 
ing potential to minimize the execution time of linear workflows in error-prone environments. 
Moreover, combined checkpointing and replication has not yet been studied in the presence of 
both fail-stop and silent errors. The combination raises new problems: for each task, we have 
to decide whether to checkpoint and/or replicate it to ensure its reliable execution. We provide 
an optimal dynamic programming algorithm of quadratic complexity to solve both problems. 
This dynamic programming algorithm has been validated through extensive simulations that 
reveal the conditions in which checkpointing only, replication only, or the combination of both 
techniques, lead to improved performance. 

Keywords: checkpoint, replication, HPC, fail-stop error, silent error, linear workflow 
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1 Introduction 

Several high-performance computing (HPC) applications are designed as a succession of (typically 
large) tightly-coupled computational kernels, or tasks, that should be executed in sequence [9, 15, 26]. 
These parallel tasks are executed on the entire platform, and they exchange data at the end of their 
execution. In other words, the task graph is a linear chain, and each task (except maybe the first 
one and the last one) reads data from its predecessor and produces data for its successor. Such 
linear chains of tasks also appear in image processing applications [30], and are usually called linear 
workflows [41]. 

The first objective when considering linear workflows is to ensure their efficient execution, which 
amounts to minimizing the total parallel execution time, or makespan. However, a reliable execution 
is also critical to performance. Indeed, large-scale platforms are increasingly subject to errors [10, 11]. 
Scale is the enemy here: even if each computing resource is very reliable, with, say, a Mean Time 
Between Errors (MTBE) of ten years, meaning that each resource will experience an error only every 
10 years on average, a platform composed of 100,000 of such resources will experience an error every 
fifty minutes [25]. Hence, fault-tolerance techniques to mitigate the impact of errors are required 
to ensure a correct and uninterrupted execution of the application [28]. To further complicate 
matters, several types of errors need to be considered when computing at scale. In addition to the 
classical fail-stop errors (such as hardware failures or crashes), silent errors (also known as silent data 
corruptions) constitute another threat that can no longer be ignored [33, 53, 51, 52, 31]. There are 
several causes of silent errors, such as cosmic radiation, packaging pollution, among others. Silent 
errors can strike the cache and memory (bit flips) components as well as the CPU operations; in the 
latter case they resemble floating-point errors due to improper rounding, but have a dramatically 
larger impact because any bit of the result, not only low-order mantissa bits, can be corrupted. 

The standard approach to cope with fail-stop errors is checkpoint with rollback and recovery [13, 
19]: in the context of linear workflow applications, each task can decide to take a checkpoint after it 
has correctly executed. A checkpoint is simply a file including all intermediate results and associated 
data that is saved on a storage medium resilient to errors; it can be either the memory of another 
processor, a local disk, or a remote disk. This file can be recovered if a successor task experiences an 
error later in the execution. If there is an error while some task is executing, the application has to 
roll back to the last checkpointed task (or to start recomputing again from scratch if no checkpoint 
was taken). Then the checkpoint is read from the storage medium (recovery phase), and execution 
resumes from that task onward. If the checkpoint was taken many tasks before an error strikes, there 
is a lot of re-execution involved, which calls for more frequent checkpoints. However, checkpointing 
incurs a significant overhead, and is a mere waste of resources if no error strikes. Altogether, there 
is a trade-off to be found, and one may want to checkpoint only carefully selected tasks. 

While checkpoint/restart [13, 19, 20] is the de-facto recovery technique for addressing fail-stop 
errors, there is no widely adopted general-purpose technique to cope with silent errors. The challenge 
with silent errors is detection latency: contrarily to a fail-stop error whose detection is immediate, a 
silent error is identified only when the corrupted data is activated and/or leads to an unusual appli¬ 
cation behavior. However, checkpoint and rollback recovery assumes instantaneous error detection, 
and this raises a new difficulty: if the error stroke before the last checkpoint, and is detected after 
that checkpoint, then the checkpoint is corrupted and cannot be used to restore the application. To 
address the problem of silent errors, many application-specific detectors, or verification mechanisms, 
have been proposed. We apply such a verification mechanism after each task in this paper. Our 
approach is agnostic of the nature of the verification mechanism (checksum, error correcting code, 
coherence test, etc.). In this context, if the verification succeeds, then the output of the task is 
correct, and one can safely either proceed to the next task directly, or save the result beforehand by 
taking a checkpoint. Otherwise, if verification fails we have to rollback to the last saved checkpoint 
and re-execute the work since that point on. However, and contrarily to fail-stop errors, silent errors 
do not cause the loss of the entire memory content of the affected processor. To account for this 
difference, we use a two-level checkpointing scheme: the checkpoint file is saved in the main memory 
of the processor before being transferred to some storage (disk) that is resilient to fail-stop errors. 
This allows for recovering faster after a silent error than after a fail-stop error. 
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Replication is a well-known, but costly, method to deal with both, fail-stop errors [22, 23, 12, 
36, 49, 21, 18, 38] and silent errors [32, 3]. While both checkpointing and replication have been 
extensively studied separately, their combination has not yet been investigated in the context of 
linear workflows, despite its promising potential to minimize the execution time in error-prone 
environments. The contributions of this work are the following: 

• We provide a detailed model for the reliable execution of linear workflows, where each task can 
be replicated or not, and with a two-level checkpoint/recovery mechanism whose cost depends 
both on the number of processors executing the task, and on whether the task is replicated or 
not. 

• We address both fail-stop and silent errors. We perform a verification after each task to detect 
silent errors and recover from the last in-memory checkpoint after detecting one. We recover 
from the last disk checkpoint after a fail-stop error. If a task is replicated, we do not need to 
roll back and we can directly proceed to the next task, unless both replicas have been affected 
(by either error type). 

• We design an optimal dynamic programming algorithm that minimizes the makespan of a 
linear workflow with n tasks, with a quadratic complexity, in the presence of fail-stop and 
silent errors. 

• We conduct extensive experiments to evaluate the impact of using both replication and check¬ 
pointing during execution, and compare them to an execution without replication. 

• We provide guidelines about when it is beneficial to employ checkpointing only, replication 
only, or to combine both techniques together. 

The paper is organized as follows. Section 2 details the model and formalizes the objective 
function and the optimization problem. Section 3 presents a preliminary result for the dynamic 
programming algorithm: we explain how to compute the expected time needed to execute a single 
task (replicated or not), assuming that its predecessor has been checkpointed. The proposed optimal 
dynamic programming algorithm is outlined in Section 4. The experimental validation is provided 
in Section 5. Finally, related work is discussed in Section 6, and the work is concluded in Section 7. 


2 Model and objective 

This section details the framework of this study. We start with the application and platform models, 
then we detail the verification, checkpointing and replication, and finally we state the optimization 
problem. 

2.1 Application model 

We target applications whose workflows represent linear chains of parallel tasks. More precisely, for 
one application, consider a chain Tj —» X 2 —>•••—> T n of n parallel tasks T,;, 1 < i < n. Hence, Tj 
must be completed before executing T 2 , and so on. 

Here, each Tj is a parallel task whose speedup profile obeys Amdahl’s law [1]: the total work, Wi, 
consists of a sequential fraction a t Wi, 0 < on < 1, and the remaining fraction (1 — cq)wj perfectly 
parallel. The (error-free) execution time, T), using q processors is thus Wi (cti + 1 ~' H ^. Without 
loss of generality, we assume that processors execute the tasks at unit speed, and we use time units 
and work units interchangeably. While our study is agnostic of task granularity, it applies primarily 
to frameworks where tasks represent large computational entities whose execution takes from a 
few minutes up to tens of minutes. In such frameworks, it may be worthwhile to replicate or to 
checkpoint tasks to mitigate the impact of errors. 

2.2 Execution platform 

We target a homogeneous platform with p processors Pi, 1 < i < p. We assume that the platform 
is subject to fail-stop and silent errors whose inter-arrival times follow an Exponential distribution. 
More precisely, let Xf nd be the fail-stop error rate of each individual processor Pp the probability 
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of having a fail-stop error striking P, within T time-units is P(X < T) = 1 — e A “ rfT . Similarly, let 
A fnd be the silent error rate of each individual processor P*: the probability of having a silent error 
striking Pj within T time-units is P(V < T) = 1 — e~ x ^ dT . Then, a computation on q < p processors 
has an error rate q\f nd for fail-stop errors, and q\f nd for silent errors. The probability of having a 
fail-stop error within T time-units and with q processors becomes 1 — e~ qXindT (and 1 — e~ qXindT for 
a silent error) [25]. 


2.3 Verification 

To detect silent errors, we add a verification mechanism at the end of each task. This ensures that 
the error will be detected as soon as possible. The verification following task p has a cost V). We 
assume that the verification mechanism has a perfect recall (it detects all errors). This guarantees 
that all taken checkpoints are correct, because they are always preceded by a verification. Similarly, 
we assume that no silent error can strike during the verification. 

The cost Vi depends upon the detector and can thereby take a wide range of values. In this 
work, we adopt a quite general formula and use 

V i {q i )=u i + V V (1) 

Qi 

to model the cost of verifying task X) when executed with qi processors, where Ui and Vi denote the 
sequential and parallel cost of the verification, respectively. In the experiments (Section 5.2), we 
instantiate the model with two cases: 

• We use Ui = 0Wj and Vi = 0, where 0 is a small parameter (around 1%). This means that 
the cost of the verification is proportional to the sequential cost Wi of Tj. It corresponds to 
the case of data-oriented kernels processing large files and checksumming for verification in a 
centralized location (hence sequentially) [5]. 

• We use Ui — 0 and Vi = fiuq. This means that the cost of the verification is proportional to the 
parallel fraction of Tj. It corresponds to the same scenario as above, but where checksumming 
is performed in parallel on all enrolled processors. 


2.4 Checkpointing 

The output of each task X) can be checkpointed in time Cj. We use a two-level checkpoint protocol 
where the checkpoint is first saved locally (memory checkpoint) before being transferred to a slower 
but reliable storage like a filesystem (disk checkpoint). The memory checkpoint will be lost when 
a fail-stop error strikes a processor (and its local data), whereas the disk checkpoint will always 
remain available to restart the application. 

When a fail-stop error strikes during the execution of Tj, we first incur a downtime D 1 and then 
we must start the execution from the task following the last checkpoint. Hence, if Tj is the last 
checkpointed task, the execution starts again at task Tj + 1 , and the recovery cost is Rj+i, which 
amounts to reading the disk checkpoint of task Tj. When a silent error is detected at the end of Tj 
by the verification mechanism, we also roll back to the last checkpointed task Tj, but (i) we do not 
pay the downtime D\ and (ii) the recovery cost is Rj+x, which amounts to reading the memory 
checkpoint of task Tj (hence at a much smaller cost than for a fail-stop error). 

The checkpoint cost Cj, and both recovery costs Rf + i and R^+\ clearly depend upon the check¬ 
point protocol and storage medium, as well as upon the number qi of enrolled processors. In this 
work, we adopt a quite general formula for checkpoint times and use 


C?:(<7i) — a i H-b c iQi 

qi 


( 2 ) 


to model the time to save a checkpoint after T executed with qi processors. Here, a* + ^ repre¬ 
sents the I/O overhead to write the task output file Mi to the storage medium. For in-memory 
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checkpointing [48], a,; + ^ is the communication time, in which a,j denotes the latency to access 
the storage system; then we have ^ = T AI ) q . , where r ne j is the network bandwidth (each processor 
stores ^ data items). For coordinated checkpointing to stable storage, there are two cases: if the 
I/O bottleneck is the storage system’s bandwidth, then a* = /? + 7 ^- and 6 j = 0, where /? is a 
start-up time and r ,; 0 is the I/O bandwidth; otherwise, if the I/O bottleneck is the network latency, 
we retrieve the same formula as for in-memory checkpointing. Finally, c,(/, represents the message 
passing overhead that grows linearly with the number of processors, in order for all processors to 
reach a global consistent state [19, 50]. 

For the cost of recovery (from memory or from disk), we assume similar formulas: 


bV 


jdM ( \ _ M . _ 1 M . 

\Qi) — a i H r c { Qi , 

Qi 


R? 


(Qi) = 




bP 

+ — 
Qi 


cf qi- 


(3) 


The coefficients depend on the type of recovery: again, a memory recovery is much faster than a disk 
recovery. If we further assume that reading and writing from/to the same storage medium (memory 
or disk) have same cost, we have 


Ciiq,) = R? +1 ( qi ) + R™ ife) 


since recovering for task Tj + 1 amounts to reading the checkpoint from task Tj. 

Finally, we assume that there is a fictitious task To of zero weight (w 0 = 0) that is always 
checkpointed, so that R^(qi) represents the time for I/O input from the external world. Similarly, 
we systematically checkpoint the last task T n , in order to account for the I/O output time C n (q n ). 


2.5 Replication 

When executing a task, we envision two possibilities: either the task is not replicated, or it is 
replicated. To explain the impact of replication, we momentarily assume that we consider fail-stop 
errors only. Then we return to the scenario with both fail-stop and silent errors. 

With fail-stop errors only, consider a task 7), and assume for simplicity that the predecessor 
Tj_i of Tj has been checkpointed. If it is not the case, i.e., if the predecessor T)_ 1 of Tj is not 
checkpointed, we have to roll back to the last checkpointed task, say Tk where k < i — 1 , whenever 
an error strikes, and re-execute the entire segment from T / +1 to T) instead of just T). 

Without replication, a single copy of Tj is executed on the entire platform, hence with qi = p 
processors. Then we let E norep (i) denote the expected execution time of Tj when accounting for 
errors. We attempt a first execution, which takes T” orep = Wi ^ctj + if no fail-stop error 

strikes. But if a fail-stop error does strike, we must account for the time that has been lost (between 
the beginning of the execution and the fail-stop error), then perform a downtime D 1 a recovery Ri(p) 
(since we use the entire platform for Tj), and then re-execute Tj from scratch. Similarly, if we decide 
to checkpoint after Tj, we need C'j(p) time units. We explain how to compute E norep (i) in Section 3. 
With replication, two copies of Tj are executed in parallel, each with Qi = | processors. If no fail- 

stop error strikes, both copies finish execution in time T[ ep = ■w l ^ctj + , since each copy uses 

| processors. If a fail-stop error strikes one copy, we proceed as before, account for the downtime D , 
recover (in time If,(|) now), and restart execution with that copy. Then there are two cases: (i) if 
the second copy successfully completes its first execution, the fail-stop error has no impact and the 
execution time remains the same as the error-free execution time; (ii) however, if the second copy 
also fails to execute, we resume its execution, and iterate until one copy successfully completes. Of 
course, case (ii) is less likely to happen than case (i), which explains why replication can be useful. 
Finally, if we decide to checkpoint after Tj, the first successful copy will take the checkpoint in time 

Ci( I). 

Replication raises several complications in terms of checkpoint and recovery costs. When a 
replicated task Tj is checkpointed, we can enforce that only one copy (the first one to complete 
execution) would write the output data onto the storage medium, hence with a cost Cj(|), as stated 
above. Similarly, when a single copy of a replicated task Tj performs a recovery after a fail-stop 
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error, the cost would be However, in the unlikely event where both copies are struck by a 

fail-stop error at close time instances, their recoveries would overlap, and the cost can vary anywhere 
between i?,(|) and 2X?j(|), depending upon the amount of contention, the length of the overlap and 
where the I/O bottleneck lies. We will experimentally evaluate the impact of the recovery cost with 
replication in Section 5.1. For simplicity, in the rest of the paper, we use C/ ep for the checkpoint 
cost of Tj when it is replicated, and C" orep when it is not. Similarly, we use R 1 r >rep or R^ I,ep for the 
recovery costs (disk or memory) when X) is replicated, and R p)norep 0 r R^ Inorep when it is not. Note 
that the recovery cost of T, depends upon whether it is replicated or not, but does not depend upon 
whether the checkpointed task X)_i was replicated or not, since we need to read the same file from the 
storage medium in both cases. The values of C- ep and C'" orep can be instantiated from Equation (2) 
and those of R^ rep , R P)norep ; R^ Irep and R^ Inorep can he instantiated from Equation (3). We let 
E rep (?) denote the expected execution time of X) with replication and when accounting for fail-stop 
errors, when T)_i is checkpointed. The derivation of E rep (i) is significantly more complicated than 
for E norep (i) and represents a new contribution of this work, detailed in Section 3.2. 

We now detail the impact of replication when both fail-stop and silent errors can strike. First, 
we have to state how the verification cost V) of task X) depends upon whether X) is replicated or 
not. For the analysis, we keep a general model and let V[ ep be the cost when X) is replicated, and 
ynorep w j ien p j s no t However, as explained later in the experimental evaluation (Section 5.2), 
we use two different instantiations of Equation (1), which directly give the two (possibly different) 
values of V[ ep and V t norep as a function of parameter /3. 

Next, consider again a task X), and still assume for simplicity that the predecessor X)_i of X) 
has been checkpointed. The impact of fail-stop errors is the same as before, and depends upon how 
many replicas of X) are executed. The only difference is that the fail-stop error can now strike either 
during the execution of a replica or during its verification. But if no fail-stop error strikes, we still 
have to perform the verification to detect a possible silent error, whose probability depends upon the 
error-free execution time of that replica. Recall that no silent error can strike during the verification 
(but a fail-stop can strike). If a silent error is detected, we have to re-execute the task, in which 
case we recover from the memory checkpoint instead of from the disk checkpoint. 

Finally, we extend the definition of E norep (i) and E rep (i) to account for both fail-stop and silent 
errors, when T)_i is checkpointed. We explain how to compute both quantities in Section 3.2. 


2.6 Optimization problem 

The objective of this work is to minimize the expected makespan of the linear workflow in the presence 
of fail-stop and silent errors. For each task, we have four choices: either we replicate the task or 
not, and either we checkpoint it or not. More formally, for each task T) we need to decide: (i) if it is 
checkpointed or not; and (ii) if it is replicated or not, (meaning that there are 4" combinations for 
the whole workflow) with the objective to minimize the total execution time of the workflow. We 
point out that none of these decisions can be made locally. Instead, we need to account for previous 
decisions and optimize globally. Our major contribution of this work is to provide an optimal 
dynamic programming algorithm to solve this problem, which we denote as ChainsRepCkpt. 

We point out that ChainsCkpt, the simpler problem without replication, i.e., optimally placing 
checkpoints for a chain of tasks, has been extensively studied. The first dynamic programming 
algorithm to solve ChainsCkpt appears in the pioneering paper of Toueg and Babaoglu [43] back in 
1984, for the scenario with fail-stop errors only (see Section 6 on related work for further references). 
Adding replication significantly complicates the solution. Here is an intuitive explanation: When 
the algorithm recursively considers a segment of tasks from X) to Tj, where X)_i and Tj are both 
checkpointed and no intermediate task Tp. i < k < j is checkpointed, there are many cases to 
consider to account for possible different values in: (i) execution time, since some tasks in the 
segment may be replicated; (ii) checkpoint, whose cost depends upon whether Tj is replicated or 
not; and (iii) recovery, whose cost depends upon whether Tj is replicated or not. We provide all 
details in Section 4. 


6 



International Journal of Networking and Computing 


3 Computing K norep (i ) and W ep {i ) 

This section details how to compute the expected time needed to execute a task Tj, assuming that 
the predecessor of Tj has been checkpointed. Hence, we need to re-execute only Tj when an error 
strikes. We explain how to deal with the general case of re-executing a segment of tasks, some of 
them replicated, in Section 4. Here, we start with the case where Tj is not replicated. It is already 
known how to compute E norep (i) [25, 6], but we present this case to help the reader follow the 
derivation in Section 3.2 for the case where Tj is replicated, which is new and much more involved. 

3.1 Computing E norep (i) 

To compute E norep (i), the average execution time of Tj with p processors without replication, we 
conduct a case analysis: 

• Either a fail-stop error strikes during the execution of the task and its verification (lasting 
j,norep norep^ an( ^ j n this case we lose some work and need to re-execute the task, recovering 
from a disk checkpoint; 

• Either there is no fail-stop error, and in this case the verification indicates whether there has 
been a silent error or not: 

— If a silent error is detected, we need to re-execute the task right after the verification, 
recovering from a memory checkpoint; 

— Otherwise the execution has been successful. 

This leads to the following recursive formula: 


E norep^ = p( Xp < rpnorep + 

+ (1 - P (X p < T? orep + 


norep ynorep^ _|_ jj _|_ j^Dnorep pnorep^-^ 
ynorep^ ^rpnorep y ynorep ^ ^ j 

+ F(X' < T™ rep )(D + R^ norep + E norei, (i))), 


where E(X p < t) is the probability of having a fail-stop error on one of the p processors before time 
t, i.e., P(X p < t) = 1 — e ~ X "‘ dPt , and P(Jf p < t) is the probability of having a silent error on one of 

the p processors before time t, i.e., P(X p < t) = 1 — e ~ x ^ dPt . The time lost when an error strikes is 
the expectation of the random variable X p , knowing that the error stroke before the end of the task 
and its verification. We compute it as follows: 


rpnorep /rpnorep 
^lost 


ynorep ^ 


J x¥{X p = x\X p < rr ep + v 2 norep )dx 

rpnorep | ynorep 

J xF {X p = x)dx 


P(X < T norep + y nore P^j 


rpnorep | ynorep 


dP(X p < x) 


P(X < X ,nore P _|_ y n ° re P ) 


p ~' dx 


dx 


After integration, we get the formula: 


rpnorep /rpnorep 
^ lost 


+ y. norep ) 


1 

X fndP 


rpnorep ynorep 


e x ?nMTr ep +K 




(5) 
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Replacing the left hand side term of Equation (5) in Equation (4) and solving, we derive: 


E norep (i) 


( 1 


\ X fndP 

1 


XL 


d p 


+ D + R° n °repj ePdAL+ADTr^+ALW 0 "”) 

I ^j^Dnorep norep ^ 'J gA f nd pT: oreP _ (£, + R Mnorep y 


( 6 ) 


Recall that T" orep = uv, in Equation ( 6 ). Finally, if we decide to checkpoint X), we 

simply add C" orep to E norep {i). 


3.2 Computing E rej, (i) 

We now discuss the case where T; is replicated; each copy executes with | processors. To compute 
E rep (i), the expected execution time of T) with replication, we conduct a case analysis similar to 
that of Section 3.1: 

• Either two fail-stop errors strike before the end of the task and its verification (lasting T[ ep + 
V[ ep ), with one fail-stop error striking each copy. Then we have lost some work and need to 
re-execute the task from a disk checkpoint; 

• Or at least one copy is not hit by any fail-stop error. Then we need to account for two different 
cases in the analysis: 

— Both copies have survived: then we need to re-execute the task (recovering from a memory 
checkpoint) only if both copies are hit by a silent error. 

— Only one replica survived: then we need to re-execute the task if this replica is hit by a 
silent error. 


This leads to the following formula: 


E rep (i) = P (Y p < T[ ep + V[ ep ) 2 (r[ o ep (T[ ep + V[ ep ) + D + R° rep + E re P(i)) 

+ (1 - P (Y p < T[ ep + V[ ep ) 2 ){T[ ep + V[ ep ) 

+ (2(1 - P(F p < T[ ep + V[ ep ))E(Y p < T[ ep + Vr P )P{Yp < T[ ep ) 

+ (1 - P (Y p < T[ ep + V[ ep )) 2 P(Yp < T[ ep ) 2 ^j (D + R? Irep + E rep {i)), 


(7) 


where P(Pp < t) is the probability of having an error on one replica of | processors before time 

\f nd P ; 

t, i.e., P (Y p < t) = 1 — e Y -*, and P (Y < t) is the probability of having a silent error on one 

, xf nd p 

replica of | processors before time t, i.e., P(l^ < t) = 1 — e 2 *. The first line of Equation (7) 
corresponds to the case where both replicas are hit by a fail-stop error, the second line accounts for 
the time spent in case at least one replica survives. The last two lines correspond to the two cases 
when we need to re-execute the task after the detection of a silent error (one replica alive for line 3, 
two replicas alive for line 4 of Equation (7)). 

The time lost when both copies fail can be computed in a similar way as before: 


KZ(T[ ep ) 


P {Y p < T[ ep + V[ ep ) 


p +v" 


d¥(Y p < x ) 
dx 


dx. 


After computation and verification using a Maple sheet, we obtain the following result: 


KZ(T[ ep + v n = 


(-2Kn d p(Tr+Vr P )-4)e~ 


M'rr r +vr ep ) 


-+{K nd p(rr+vr)+i)e 


dP(, T r + v 1 


+3 


(e 


-l) 2 Af nd P 


(8) 
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Replacing the left hand side term of Equation (8) in Equation (7) and solving, we get: 


E rep (i) = - 


(4 + 2A ( ndP (R° rep - Rf* re P))eP( X ' nd{T '2 +V ' D 


(2e 


- !) ' X fndP 


(9) 


+ 


(1 + ^ nd P{R? reV - Rf rep ))e X ^ P 

- !) ' X fndP 


+ 


(2 e p 

(3 + A f nd p(D + R^)) e ^ATr+vr)+xLTr) 


(2e 


^ !) ' A f nd P 


-(D + Rf rep ) 


Recall that T[ ep = Wi (aj + 1 -A 1 J in Equation (9). Finally, if we decide to checkpoint T t , we simply 
add C[ ep to E rep (?). 


4 Optimal dynamic programming algorithm 


In this section, we provide an optimal dynamic programming (DP) algorithm to solve the Chains- 
RepCkpt problem for a linear chain of n tasks. 

Theorem 1. The optimal solution to the ChainsRepCkpt problem can be obtained using a dynamic 
programming algorithm in 0(n 2 ) time, where n is the number of tasks in the chain. 

Proof. The algorithm recursively computes the expectation of the optimal time required to execute 
tasks T\ to T, and then checkpoint T, t . As already mentioned, we need to distinguish two cases, 
according to whether T) is replicated or not, because the cost of the final checkpoint depends upon 
this decision. Hence, we recursively compute two different functions: 

• Tf^ p (i), the expectation of the optimal time required to execute tasks T) to T), knowing that 
Ti is replicated; 

• Tff°f ep {i), the expectation of the optimal time required to execute tasks T\ to Ti, knowing that 
Ti is not replicated. 

Note that checkpoint time is not included in Tf^ p (?) nor Tff°[ ep (i). The solution to ChainsRepCkpt 
will be given by 

min {T o ; e t » + Cff p , Tff°f ep (n) + C™*}. (10) 

We start with the computation of (j) for 1 < j < n, hence assuming that the last task Tj is 
replicated. We express T™ p (j) recursively as follows: 


T^(j) 


min < 

i <i<j I 


T%(i) + Cr+TZZ'"*(i + l,j), 
Tf; p (z) + cr + T^ rep (i + i,j), 
T™ rep (i) + cr ep + T r N ep ’ rep (i + 1 ,j), 

T:° rep (if+Cf orep +T^ ep ’ rep (i + 1, j), > 

R°rev +T ff p ’ rep {l,j), 

norep _|_ rpnorep, rep ^ ^ 


(ii) 


In Equation (11), Tj corresponds to the last checkpointed task before Tj, and we try all possible 
locations Tj for taking a checkpoint before Tj. The first four lines correspond to the case where 
there is indeed an intermediate task Tj between T-| and Tj that is checkpointed, while the last two 
lines correspond to the case where no checkpoint at all is taken until after Tj. 

The first two lines of Equation (11) apply to the case where Tj is replicated. Line 1 is for the case 
when Tj + i is replicated, and line 2 when it is not. In the first line of Equation (11), Tff ^ rep (?’ + l, j) 
denotes the optimal time to execute tasks T i+ -| to Tj without any intermediate checkpoint, knowing 
that Tj is checkpointed, and both T i+1 and Tj are replicated. If T i+1 is not replicated, we use the 
second line of Equation (11), where T^° ( f ep ’ rep (i+ 1 ,j) is the counterpart of Tff^ rep (i + 1 ,j), except 
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that it assumes that T,+i is not replicated. This information on Xj + i (replicated or not) is needed 
to compute the recovery cost when executing tasks T i+1 to X) and experiencing an error. 

Lines 3 and 4 apply to the case where Tj is not replicated, with similar notation as before. In 
the first four lines, no task between Tj + i and X)_i is checkpointed, hence the notation NC for no 
checkpoint. 

If no checkpoint at all is taken before X) (this corresponds to the case * = 0), we use the last 
two lines of Equation (11): we include the cost to read the initial input, which depends whether T\ 
is replicated (in line 5) or not (in line 6 ) of Equation (11). 

We have a very similar equation to express T™° t rep (j) recursively, with intuitive notation: 


K°ru) 


min < 


T~?(i) + cr + T^ no ^(i + 1 ,j), 

T%(i)+cr+ T rr p ’ norep (i+ u), 

+ cr ep +T r N e r orep (*+u), 
r°r^+cr ep +Trc ep,norep ^ +uv 

T)Urep ! rj-irep,norep (-] *\ 

n l ' 1 NC 

j^Dnorep | rj-inorep,norep ^ 


( 12 ) 


To synthesize the notation, we have defined {i + 1 ,j), with A,B£ {rep 1 norep} 1 as the 
optimal time to execute tasks Tj + i to Tj without any intermediate checkpoint, knowing that X,; is 
checkpointed, T i+ \ is replicated if and only if A = rep , and Tj is replicated if and only if B = rep. 
In a nutshell, we have to account for the possible replication of the first task X, +1 after the last 
checkpoint, and of the last task Tj, hence the four cases. 

There remains to compute T^^{i,j) for all 1 < i,j < n and A, B £ {rep, norep}. This is still 
not easy, because there remains to decide which intermediate tasks should be replicated. In addition 
to the status of Tj (replicated or not, according to the value of B), the only thing we know so far is 
that the only checkpoint that we can recover from while executing tasks Tj to Tj is the checkpoint 
taken after task Tj_i, hence we need to re-execute from Tj whenever an error strikes. Furthermore, 
Ti is replicated if and only if A = rep, hence we know the corresponding cost for recovery, R.f. 
Letting T N '^ ( i,j ) = 0 whenever i > j, we can express T N q ( i,j ) for 1 < i < j < n as follows: 

T A ’ B (i A— min {T A ' rep (i n - 11 T A ’ norep (i t-lll 

+ T A,B (j | i). 


Here the new (and final) notation T A,B {j \ i ) is simply the time needed to execute task Tj, 
knowing that an error during Tj implies to recover from Xj. Indeed, to execute tasks X, to Tj, we 
account recursively for the time to execute Xj to X ? _i; X,;_i is still checkpointed; Xj is replicated if 
and only if A = rep, Tj is replicated if and only if B = rep, and we consider both cases whether Tj —i 
is replicated or not. The time lost in case of an error during Tj depends whether Tj is replicated 
or not, and we need to restart from Xj in case of error, hence the notation T A,B (j \ i), representing 
the expected execution time for task Tj with or without replication (depending on B), given that 
we need to restart from Xj if there is an error (and Xj is replicated if and only if A = rep). 

The last step is hence to express these execution times. We start with the case where Tj is not 
replicated: 


T A,norep(j | j) = (l _ (V”^(X”° rep + V™^) + D + R D f 

+ min {T AAep (i,j - 1 ),T A r rep (i,J - 1)} +T A ’™™ p (j \ i)) 

+ e-*L (rnorep + ynorep + ^ _ g -A f ndP T^ (D + rM A 

+ min {Tpr P (i,j - l),X^”(i,j - 1)} +T A ’™™ p (j \ t))) 


_ \ F /rpnorep ,-rr norep \ 

The term in e indP( j + i 1 represents the case without fail-stop error, where the execution 

time is simply _|_ yn° re P _ a s il e nt error is detected after the verification, we pay a downtime 
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and a memory recovery (with a cost depending on A). Next, we need to re-execute all the tasks 
since the last checkpoint (T) to Tj_ i) and take the minimal value obtained out of the execution 
where X)_i is replicated or not; finally, we execute Tj again (with a time T A,norep (j \ i )) from last 
checkpoint. When a fail-stop error strikes, we account for T™°™ p (T™ orep + y™ orep ^ the time lost 
within Tj, and whose value is given by Equation (5). Then we pay a downtime and a disk recovery 
(with a cost depending on A). Finally, we re-execute all the tasks from last checkpoint and that is 
similar to the previous case. 

The formula is similar with replication of Tj , where the probability of error accounts for the fact 
that we need to recover only if both replicas fail for the fail-stop errors and accounts for the number 
of living replicas in the case where a silent error is detected (see Section 3.2 for the details): 


T A ’ rep (j | *)=(!- 


AndP( T 7 p + v r> 


(KZ( T r + F / ep )+ d + rd 


+ min {T£S*(i,j - 1 ),T^(i,j 1)} + T A ’ rep (j \ i)) 


+ 1- 1-e 


*fndP( T r+ v r p '>' 2N 


( r; ep + v[ ep ) 

/ \ F .p(T rep +V rep ) \ F .p(TV !p +V ,ep ) 

, ( _ ind 0 ___ 2 _ 1 \_ ind 0 __ 

+ ( (1 — e 2 je 2 (1 — e 

\S rprep 

+e -ALp(*r+VT) (1 - e -^f^) 2 ) (D + R M f 
+ min {T^ ep (i,j - 1 ),T^ orep (i,j 1 )}+T A ’ rep (j \ i)). 


A 

i 


Note that the value of T^ p (Tj ep ) is given by Equation (8). Overall, we need to compute the 
0(n 2 ) intermediate values T A,B (j \ i) and T^'^{i,j) for 1 < i,j < n and A,Bg {rep, norep}, and 
each of these take constant time. There are O(n) values T A )t (i), for 1 < i < n and A £ {rep, norep}, 
and these perform a minimum over at most 6 n elements, hence they can be computed in 0(n). The 
overall complexity is therefore 0(n 2 ). □ 


5 Experiments 

In this section, we evaluate the advantages of adding replication to checkpointing in the presence 
of both, fail-stop and silent errors. We point out that the simulator that implements the proposed 
DP algorithm is publicly available at http://graal.ens-lyon.fr/~yrobert/chainsrep.zip so 
that interested readers can instantiate their preferred scenarios and repeat the same simulations for 
reproducibility purpose. The code is written in-house in C++ and does not use any library other 
than the STL. 

We start by assessing scenarios with fail-stop errors only in Section 5.1. We first describe the 
evaluation framework in Section 5.1.1, then we compare checkpoint with replication to checkpoint 
only in Section 5.1.2. In Section 5.1.3, we assess the impact of the different model parameters on 
the performance of the optimal strategy. Finally, Section 5.1.4 compares the performance of the 
optimal solution to alternative sub-optimal solutions. 

Then we assess scenarios with both fail-stop and silent errors in Section 5.2. We first describe 
the few modifications of the evaluation framework in Section 5.2.1, then we compare checkpoint 
with replication to checkpoint only in Section 5.2.2. Finally, Section 5.2.3 assesses the impact of the 
different model parameters on the performance of the optimal strategy. 

5.1 Scenarios with fail-stop errors only 

5.1.1 Experimental setup 

We fix the total work in the chain to W = 10,000 seconds. The choice of this value is less important 
than the duration of the tasks compared to the error rate. For this reason, we rely on five different 
work distributions, where all tasks are fully parallel (ccj = 0): 
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• Uniform: every task i is of length Wi = i.e., identical tasks. 

• Increasing: the length of the tasks constantly increases, i.e., task T, has length Wi = 


Decreasing: the length of the tasks constantly decreases, i.e., task T) has length = 


• HighLow: the chain is formed by long tasks followed by short tasks. The long tasks represent 
60% of the total work and there are |"yg] such tasks. Short tasks represent the remaining 40% 
of the total work and consequently there are n — \ yg] small tasks. 

• Random: task lengths are uniformly chosen at random between ^ and . If the total work 
of the first i tasks reaches W, the weight of each task is multiplied by - so that we can continue 
adding the remaining tasks. 


Experiments with increasing sequential part (a,;) for the tasks are available in the companion 
research report [4], Setting ctj = 0 amounts to being in the worse possible case for replication, since 
the tasks will fully benefit of having twice as much processors when not replicated. 

For simplicity, we assume that checkpointing costs are equal to the corresponding recovery 
costs, assuming that read and write operations take approximately the same amount of time, i.e., 
R i+1 = C i . lor replicated tasks, we set C i F = aC t y and A, F = «W. t % where 
1 < a < 2, and we assess the impact of parameter a in Section 5.2.3. In the following experi¬ 
ments, we measure the performance of a solution by evaluating the associated normalized expected 
makespan, i.e., the expected execution time needed to compute all the tasks in the chain, with 
respect to the execution time without errors, checkpoints, or replicas. 


5.1.2 Comparison to checkpoint only 

We start with an analysis of the solutions obtained by running the optimal dynamic programming 
(DP) algorithm ChainsRepCkpt on chains of 20 tasks for the five different work distributions 
described in Section 5.1.1. We also run a variant of ChainsRepCkpt that does not perform any 
replication, hence using a simplified DP algorithm, that is called ChainsCkpt. 

We vary the fail-stop error rate A f nd p from 10 -8 to 10 -2 . Note that when A f nd p = 10 -3 , we 
expect an average of 10 errors per execution of the entire chain (neglecting potential errors during 
checkpoints and recoveries). The checkpoint cost C™ orev = a* is constant per task (hence b, = c* = 0) 
and varies from 10-3j , ™ ore P to 10 3 T" orep . For replicated tasks, we set a = 1 in this experiment, i.e., 

c rep = c norep &nd R Drep = R D norep 

Figure 1 presents the results of these experiments for the Uniform distribution. We are inter¬ 
ested in the number of checkpoints and replicas in the optimal solution. As the optimal solution 
may or may not contain checkpoints and replicas, we distinguish 4 cases: None means that no 
task is checkpointed nor replicated, Checkpointing Only means that some tasks are checkpointed 
but no task is replicated, Replication Only means that some tasks are replicated, but no task is 
checkpointed, and Checkpointing+Replication means that some tasks are checkpointed and some 
tasks are replicated. First, we observe that when the checkpointing cost is less than or equal to the 
length of a task (on the left of the black line), the optimal solution does not use replication, except 
when the error rate becomes very high. However, if the checkpointing cost exceeds the length of 
one task (on the right of the black vertical bar), replication proves useful in some cases. In par¬ 
ticular, when the fails-stop error rate A f nd p is medium to high (i.e., 1CU 6 to 10” 4 ), we note that 
only replication is used, meaning that no checkpoint is taken and that replication alone is a better 
strategy to prevent any error from stopping the application. When the error rate is the highest (i.e., 
10 -4 or higher), replication is added to the checkpointing strategy to ensure maximum reliability. 
It may seem unusual to use replication alone when checkpointing costs increase. This is because 
the recovery cost has to be taken into account as well, in addition to re-executing the tasks that 
have failed. Replication is added to reduce this risk: if successful, there is no recovery cost to pay 
for, nor any task to re-execute. Finally, note that for low error rates and low checkpointing costs, 
only checkpoints are used, because their cost is lower than the average re-execution time in case of 
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Checkpoint/Recovery cost over task length ratio 


Figure 1: Impact of checkpoint/recovery cost and error rate on the usage of checkpointing and 
replication. Total work is fixed to 10,000s and is distributed uniformly among n = 20 tasks (i.e., 
T\ = T% = • • ■ = T20 = 500s). Each color shows the presence of checkpoints and/or replicas in the 
optimal solution. Results corresponding to the case highlighted with a red square are presented in 
Figure 2. 


error. We point out that similar results are obtained when using other work distributions (see the 
extended version [4]). 

In the next experiment, we focus on scenarios where both checkpointing and replication are useful, 
i.e., we set the checkpointing cost to be twice the length of a task (i.e., C™ orep = cq = 2T" orep ), and 
we set the fail-stop error rate A f nd p to 10 -3 , which corresponds to the case highlighted by the red box 
in Figure 1. Figure 2 presents the optimal solutions obtained with the ChainsCkpt and Chains- 
RepCkpt algorithms for the Uniform, Increasing, Decreasing, HighLow and Random work 
distributions, respectively. First, for the Uniform work distribution, it is clear that the Chains- 
RepCkpt strategy leads to a decrease in the number of checkpoints compared to the ChainsCkpt 
strategy. Under the ChainsCkpt strategy, a checkpoint is taken every two tasks, while under the 
ChainsRepCkpt strategy, a checkpoint is instead taken every three tasks, while two out of three 
tasks are also replicated. Then, for the Increasing and Decreasing work distributions, the results 
show that most tasks should be replicated, while only the longest tasks are also checkpointed. A 
general rule of thumb is that replication only is preferred for short tasks while checkpointing and 
replication is reserved for longer tasks, where the probability of error and the re-execution cost are 
the highest. Finally, we observe a similar trend for the HighLow work distribution, where two of 
the first four longer tasks are checkpointed and replicated. 

Figure 3 compares the performance of ChainsRepCkpt to the checkpoint-only strategy Chains¬ 
Ckpt. First, we observe that the expected normalized makespan of ChainsCkpt remains almost 
constant at ~ 4.5 for any number of tasks and for any work distribution. Indeed, in our scenario, 
checkpoints are expensive and the number of checkpoints that can be used is limited to w 17 in the 
optimal solution, as shown in the middle plot. However, the ChainsRepCkpt strategy can take 
advantage of the increasing number of shorter tasks by replicating them. In this scenario (high error 
rate and high checkpoint cost), this is clearly a winning strategy. The normalized expected makespan 
decreases with increasing n, as the corresponding number of tasks that are replicated increases almost 
linearly. The ChainsRepCkpt strategy reaches a normalized makespan of « 2.6 for n = 100, i.e., 
a reduction of 35% compared to the normalized expected makespan of the ChainsCkpt strategy. 
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Figure 2: Optimal solutions obtained with the ChainsCkpt algorithm (top) and the ChainsRep¬ 
Ckpt algorithm (bottom) for the five work distributions. 
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Figure 3: Comparison of the ChainsCkpt and ChainsRepCkpt strategies for different numbers 
of tasks: impact on the makespan (left), number of checkpoints (middle) and number of replicas 
(right) with a fail-stop error rate of A f nd p = 10 -3 and a constant chekpointing/recovery cost C™ orep = 
C- ep = 1000s. 





Figure 4: Impact of fail-stop error rate A f nd p (left), checkpoint cost (middle), and ratio a between 
the checkpointing cost for replicated task C l rep over non-replicated tasks C'" orep (right). 

This is because replicated tasks tend to decrease the global probability of having an error, thus 
reducing even more the number of checkpoints needed as seen previously. Regarding the HighLow 
work distribution, we observe a higher optimal expected makespan for both the ChainsCkpt and 
the ChainsRepCkpt strategies. Indeed, in this scenario, the first tasks are very long (60% of the 
total work), which greatly increases the probability of error and the associated re-execution cost. 


5.1.3 Impact of error rate and checkpoint cost on the performance 

Figure 4 shows the impact of three of the model parameters on the optimal expected normalized 
makespan of both ChainsCkpt and ChainsRepCkpt. First, we show the impact of the fail-stop 
error rate A f nd p on the performance. The ChainsRepCkpt algorithm improves the ChainsCkpt 
strategy for large values of A f nd P- replication starts to be used for A f nd p > 2.6 x 10~ 4 and it reduces 
the makespan by « 16% for A f nd p = 10~ 3 and by up to ~ 40% when A f nd p = 10 -2 , where all tasks 





Figure 5: Comparison of the ChainsCkpt and ChainsRepCkpt strategies for different numbers 
of processors, with different model parameter values for the checkpointing cost (dj, b t . Cj). 
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are checkpointed and replicated. 

Then, we investigate the impact of the checkpointing cost with respect to the task length. As 
shown in Figure 1, replication is not needed for low checkpointing costs, i.e., when the checkpointing 
cost is between 0 and 0.8 times the cost of one task: in this scenario, all tasks are checkpointed and 
both strategies lead to the same makespan. When the checkpointing cost is between 0.9 and 1.6 
times the cost of one task, ChainsRepCkpt checkpoints and replicates half of the tasks. Overall, 
the ChainsRepCkpt strategy improves the optimal normalized expected makespan by « 11% for 
a checkpointing cost ratio of 1.6, and by as much as ss 36% when the checkpointing cost is five times 
the length of one task. 

We now investigate the impact of the ratio between the checkpointing and recovery cost for 
replicated tasks and non-replicated tasks a and we present the results for a = 1 (C[ ep = R® rep = 

c norep = R Dnorep^ ^ = 1>g rep = R Drep = X g fjnorep = L5R Dnorep ) &nd Q = 2 (C rep = R Drep = 

2 c™ore P _ 2R^ norep y As expected, the makespan increases with a, but it is interesting to note that 
the makespan converges towards a same lower-bound as the number of (shorter) tasks increases. As 
shown previously, when tasks are smaller, ChainsRepCkpt favors replication over checkpointing, 
especially when the checkpointing cost is high, which means less checkpoints, recoveries and re- 
executions. 

Finally, we evaluate the efficiency of both strategies when the number of processors increases. 
For this experiment, we instantiate the model using variable checkpointing costs, i.e., we do not use 
bi = Ci = 0 anymore, so that the checkpointing/recovery cost depends on the number of processors. 
We set n = 50, A f nd = 10~ 7 and we make p vary from 10 to 10,000 (i.e., the global error rate varies 
between 10 -6 and 10~ 3 ). Figure 5 presents the results of the experiment using three different sets 
of values for a,;, bi and c,;. We see that when bi increases while Ci decreases, the replication becomes 
useless, even for the larger error rate values. However, when the term Cip becomes large in front 
of ^, we see that ChainsRepCkpt is much better than ChainsCkpt, as the checkpointing costs 
tend to decrease, in addition to all the other advantages investigated in the previous sections. With 
p = 10, 000, the three different experiments show an improvement of 80.5%, 40.7% and 0% (from 
left to right, respectively). 

5.1.4 Impact of the number of checkpoints and replicas 

Figure 6 shows the impact of the number of checkpoints and replicas on the normalized expected 
makespan for different checkpointing costs and fail-stop error rates A f nd p under the Uniform work 
distribution. We show that the optimal solution with ChainsRepCkpt (highlighted by the green 
box) always matches the minimum value obtained in the simulations, i.e., the optimal number of 
checkpoints, number of replicas, and expected execution times are consistent. In addition, we show 
that in scenarios where both the checkpointing cost and the error rate are high, even a small deviation 
from the optimal solution can quickly lead to a large overhead. 

5.2 Scenarios with both fail-stop and silent errors 

In this section we evaluate the power of replication in addition to checkpointing on platforms subject 
to both fail-stop and silent errors. 

5.2.1 Experimental setup 

All the model parameters are instantiated as before, with the following changes to account for the 
presence of silent errors. Unless stated otherwise, the fail-stop error rate has been set to 1.28e-3s _1 
and the silent error rate has been set to 5.48e-3s _1 . The silent error rate has been computed from 
real measures [2]: we derived a non-corrected silent error rate per core of 5.48e-9s . Similarly, the 

fail-stop error rate per core considered was 1.28e-9s , which corresponds to a core lifetime of 25 

years. Finally, we considered a platform of 1 million cores which tends to be the trend for current 
Top500 machines [42]. 
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Figure 6: Impact of the number of checkpoints and replicas on the normalized expected makespan 
for fail-stop error rates of A f nd = 10 -4 (top), A = 10 -3 (middle) and A = 10~ 2 (bottom) and 
for checkpointing costs of 0.5 x T" orep (left), 1 x T/ lorep (middle) and 2 x T" orep (right), with 
q no rep _ rep un( j er Uniform work distribution. The optimal solution obtained with ChainsRep- 
Ckpt always matches the minimum simulation value and is highlighted by the green box. 


As for other parameters, we considered a verification cost of 1% of the corresponding task length. 
The cost of memory recovery was set to 5% of that of a disk recovery, considering an average between 
different measured values from [31]. 

For simplicity, we assume that checkpointing costs are equal to the sum of the corresponding 
recovery costs, assuming that read and write operations take approximately the same amount of time, 
i.e., R° + n ° rep + R™ + T rep = C™ rep . For replicated tasks, we set C[ ep = aC™ rep , R° rep = aR? norep 
and Rf Irep = aRf Inorep , where 1 < a < 2, and we assess the impact of parameter a in Section 5.2.3. 
As in the previous section, we measure the performance of a solution by evaluating the associated 
normalized expected makespan, i.e., the expected execution time needed to compute all the tasks in 
the chain, with respect to the execution time without errors, checkpoints, or replicas. 

5.2.2 Comparison to checkpoint only 

We start with an analysis of the solutions obtained by running the optimal dynamic programming 
(DP) algorithm ChainsRepCkpt on chains with 20 tasks for the five different work distributions 
described in Section 5.1.1. We also run a variant of ChainsRepCkpt that does not perform any 
replication, hence using a simplified DP algorithm, that is called ChainsCkpt. 

We vary the fail-stop error rate A f nd p from 10~ 8 to 10 -2 , without changing the silent error 
rate A f nd . The disk checkpoint/recovery cost is constant per task and varies from 10- 3 j , . nore P to 
l0 3 T norep the memory checkpoint/recovery cost varies from 5 x tO _5 T™ orep to 50 T™ orep ). 

Overall, all checkpoints have a cost from 1.05 x to _3 T" orep to 1.05 x io 3 T™ orep as we always perform 
both types of checkpoints. For replicated tasks, we set a = 1 in this experiment, i.e., C/ ep = C™ orep , 
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j^Drep _ j^Dnorep anc j R^ Ire P = norep j n ano £]. ier experiment, we also make the silent error rate 
A fndP vary from 10~ 8 to 10 -2 without changing the fail-stop error rate of 1.28e-3, with the same 
range for the checkpoint cost. 
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Figure 7: Impact of checkpoint/recovery cost and error rates on the usage of checkpointing and 
replication. Total sequential work is fixed to 10,000s and is distributed uniformly among n = 20 
tasks (i.e., T\ = = • ■ ■ = T 20 = 500s). Each color shows the presence of checkpoints and/or 

replicas in the optimal solution. 

Figure 7 presents the results of these experiments for the Uniform distribution. The colors are 
the same as in Figure 1, with Checkpointing Only meaning that some tasks are checkpointed but 
no task is replicated and Checkpointing+Replication meaning that some tasks are checkpointed and 
some tasks are replicated. The left figure presents the results when the silent error rate is fixed but 
the fail-stop error rate varies. The right figure presents the results of the other experiment with a 
fixed fail-stop error rate and different silent error rates. 

First, we observe that with silent errors, checkpointing becomes mandatory. Too many failures 
can strike during the execution, and checkpointing helps reducing the time spent on rollbacks and 
re-executions. However, as soon as the cost of a checkpoint exceeds the length of a task, replication 
becomes useful and this remains true even for low error rates. This holds for both fail-stop errors 
(left) and silent errors (right). There is one exception: when the fail-stop error rate is lower than 
10 -5 and the checkpointing cost is less than twice the length of a task, checkpoints are sufficient 
and is replication is not needed. Replication is overall not needed under good conditions, however 
for our real setup, indicated by the red box, using both checkpointing and replication is a better 
solution. We point out that similar results are obtained when using other work distributions (see 
the extended version [4]). 

In the next experiment, we focus on scenarios where both checkpointing and replication are 
useful, i.e., we set the checkpointing cost to be twice the length of a task (i.e., C'" orep = a,, = 
2T" orep ), keeping A f nd p = 1.28e-3 and A f nd p = 5.48e-3, for the fail-stop and silent error rates, 
respectively, which corresponds to the case highlighted by the red box in Figure 7. Figure 8 presents 
the optimal solutions obtained with the ChainsCkpt and ChainsRepCkpt algorithms for the 
Uniform, Increasing, Decreasing, HighLow and Random work distributions, respectively. 
With two sources of errors, the solution is straightforward: almost every task must be checkpointed, 
with the exception of one (short) task for the Decreasing and Increasing distributions. However 
almost every task is also replicated (20 tasks out of 20 for the Uniform distribution compared 
to only 13 in the experiments of Section 5.1.2), showing once more that replication grants better 
protection to failures even if it increases the failure-free execution time. Checkpoints are being 
taken the same way as in our previous experiments: long tasks are systematically checkpointed 
while shorter tasks are either unprotected or replicated, as can be seen with the first tasks of the 
Increasing distribution and the last task of the Decreasing distributions. 

Figure 9 compares the performance of ChainsRepCkpt to the checkpoint-only strategy Chains- 
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Figure 8 : Optimal solutions obtained with the ChainsCkpt algorithm (top) and the ChainsRep¬ 
Ckpt algorithm (bottom) for the five work distributions. 
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Figure 9: Comparison of the ChainsCkpt and ChainsRepCkpt strategies for different numbers 
of tasks: impact on the makespan (left), number of checkpoints (middle) and number of replicas 
(right) with a fail-stop error rate of A f nd p = 1.28e-3, a silent error rate of A f nd p = 5.48e-3 and a 
constant chekpointing/recovery cost C" orep = C’" ep = 1000s. 


Ckpt with fail-stop and silent errors. First, we observe that long tasks, being more likely to fail 
than shorter tasks, introduce a high overhead. As a consequence, with 20 tasks, the normalized 
makespan is too high and the execution of such applications is not possible, independently of the 
work distribution and the chosen checkpointing strategy. With more tasks however, the Chains¬ 
RepCkpt strategy always yield a shorter makespan compared to the ChainsCkpt strategy. For 
example, with 100 tasks, the normalized makespan obtained with the ChainsRepCkpt strategy is 
as high as « 8.5 (and much more for the HighLow distribution), compared to « 13 for Chains¬ 
Ckpt. Indeed, with such high error rates, all tasks are replicated under the ChainsRepCkpt 
strategy, as can be seen on the right plot, but fewer tasks need to be be checkpointed (up to 50% 
fewer checkpoints with 100 tasks and the Uniform distribution). 

The improvement is comparable to the 35% improvement observed with only fail-stop errors. 
Once again, replicated tasks tend to decrease the global probability of having an error, thus slightly 
reducing the number of checkpoints needed, while reducing the re-execution costs that can be very 
important with late-detected silent errors. Regarding the HighLow work distribution, we again 
observe a higher optimal expected makespan for both the ChainsCkpt and the ChainsRepCkpt 
strategies. Indeed, in this scenario, the first tasks are very long (60% of the total work), which greatly 
increases the error probability and the associated re-execution cost. Overall, for such applications 
on platforms subject to both, fail-stop and silent errors, replication appears to be mandatory and 
allows a reduction of the makespan of at least 30% if tasks are not too large (i.e. the probability of 
completing the task is not close to 1). 

5.2.3 Impact of error rate and checkpoint cost on the performance 

Figure 10 shows the impact of four of the model parameters on the optimal expected normalized 
makespan of both ChainsCkpt and ChainsRepCkpt, using the Uniform distribution. First, 
we show the impact of the fail-stop error rate A f nd p on the performance. The ChainsRepCkpt 
strategy always yields shorter makespans compared to the ChainsCkpt strategy. All tasks are 
always replicated, reducing the probability of having an error for each task, and each task is also 
checkpointed. The normalized makespan for ChainsCkpt is 19.5 for A f nd p = 10 —5 , compared to 
19.2 for ChainsRepCkpt, i.e. a reduction of only 1.7%, but this goes up to 50.3 for A f nd p = 1.14e-3 
compared to 35.2 when using replication, i.e. a reduction of 30%. The results are similar when we 
vary the silent error rate: when A f nd p = 10 —5 , ChainsRepCkpt results in a normalized makespan 
of 4.60 compared to 5.55 with ChainsCkpt, i.e. a reduction of 17%, and this goes up to more than 
30% when A f nd p > 5 x 10~ 3 . 

Then, we investigate the impact of the checkpointing cost with respect to the task length. The 
results are slightly different now that we have silent errors: ChainsCkpt and ChainsRepCkpt 
behave similarly only for small values of checkpoint cost. ChainsRepCkpt becomes better than 
ChainsCkpt for C > 0.525, thus reducing the makespan obtained using only checkpoints. Both 
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Checkpoint/Recovery cost over task length ratio 




Figure 10: Impact of fail-stop error rate A f nd p (left), checkpoint cost (middle) and ratio a between the 
checkpointing cost for replicated task C^ ep over non-replicated tasks C™ rep (right) for the Uniform 
distribution. 





Figure 11: Comparison of the ChainsCkpt and ChainsRepCkpt strategies for different numbers 
of processors, with different model parameter values for the checkpointing cost (a^, Cj). 
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strategies yield a makespan that increases linearly with the checkpointing cost, however the Chains- 
RepCkpt strategy needs less checkpoints, and the makespan increases slower. This means that the 
costlier the checkpoints the better the improvement thanks to replication. Overall, the execution 
under the ChainsRepCkpt strategy is 1.17 times faster than ChainsCkpt for a checkpointing cost 
of 1.05T" orep , 1.66 times faster for a checkpoint cost of 3.15T" orep , and this goes up to 1.95 times 
faster when the checkpointing cost is 5.25T" orep . 

We now investigate the impact of the ratio a between the checkpointing and recovery cost for 
replicated tasks and non-replicated tasks and we present the results for a = 1, a = 1.5 and a = 2. As 
expected, the makespan increases with a, but it is interesting to note that the makespan converges 
towards a same lower-bound as the number of (shorter) tasks increases. As shown previously, 
when tasks are smaller, ChainsRepCkpt favors replication over checkpointing, especially when the 
checkpointing cost is high, which means fewer checkpoints, recoveries and re-executions. 

Finally, we evaluate the efficiency of both strategies when the number of processors increases. 
For this experiment, we instantiate the model using variable checkpointing costs, i.e., we do not use 
bi = Ci = 0 anymore, so that the checkpointing/recovery cost depends on the number of processors. 
We set n = 50, A f nd = 1.28 x 10 -9 , A f nd = 5.48 x 10 -9 and we make p vary from 1000 to 2,000,000 
(i.e., the error rates vary between 10 -6 and 10“ 2 approximately). Figure 11 presents the results 
of the experiment using three different sets of values for a,, bi and Cj. The trend is the same as 
previously with fail-stop errors: when bi increases and c t decreases, the advantage of using replication 
becomes less clear. However, on every plot, ChainsCkpt and ChainsRepCkpt grants the same 
makespan only when using a few cores. Every plot shows that, with the increasing number of cores 
on nowadays platforms, ChainsRepCkpt will behave better and better compared to ChainsCkpt. 
In particular, the improvement for each set of parameters (from left to right) is 69%, 30% and 0% 
for p = 500000, and is 76%, 60% and 16% for p = 1500000. 


6 Related work 

In this section, we discuss the work related to checkpointing and replication. Each of these mecha¬ 
nisms has been studied for coping with fail-stop errors and/or with silent errors. The present work 
combines checkpointing and replication for linear workflows in the presence of fail-stop and silent 
errors. 

6.1 Checkpointing 

The de-facto general-purpose recovery technique in high-performance computing is checkpointing 
and rollback recovery [13, 20]. Checkpointing policies have been widely studied and we refer to [25] 
for a survey of various protocols. 

For divisible load applications where checkpoints can be inserted at any point in the execution 
for a nominal cost C, there exist well-known formulas proposed by Young [46] and Daly [16] to 
determine the optimal checkpointing period. For applications expressed as linear workflows, such as 
considered in the present work, the problem of finding the optimal checkpointing strategy, i.e., of 
determining which tasks to checkpoint, to minimize the expected execution time, has been solved 
by Toueg and Babaoglu [43]. 

Single-level checkpointing schemes suffer from the intrinsic limitation that the cost of checkpoint¬ 
ing and recovery grows with the error probability, and becomes unsustainable at large scale [23, 8] 
(even with diskless or incremental checkpointing [34]). Recent advances in decreasing the cost of 
checkpointing include multi-level checkpointing approaches, or the use of SSD or NVRAM as sec¬ 
ondary storage [11]. To reduce the I/O overhead, various two-level checkpointing protocols have 
been studied. Vaidya [44] proposed a two-level recovery scheme that tolerates a single node error 
using a local checkpoint stored on a partner node. If more than one error occurs during any local 
checkpointing interval, the scheme resorts to the global checkpoint. Silva and Silva [37] advocated 
for a similar scheme by using memory protected by XOR encoding to store local checkpoints. Di 
et al. [17] analyzed a two-level computational pattern, and proved that equal-length checkpointing 
segments constitute the optimal solution. Benoit et al. [7] relied on disk checkpoints to cope with 
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fail-stop errors and used memory checkpoints coupled with error detectors to handle silent data 
corruptions. They derived first-order approximation formulas for the optimal pattern length as well 
as the number of memory checkpoints between two disk checkpoints. The present work employs 
single-level checkpointing (in memory or on stable storage) for individual tasks in linear workflows. 


6.2 Replication 

As mentioned earlier, this work only considers task duplication. Triplication [29] (three replicas 
per task) is also possible yet only useful with extremely high error rates, which are unlikely in 
HPC systems. The use of redundant MPI processes is analyzed in [12, 22, 23]. In particular, 
Ferreira et al. [23] studied the use of process replication for MPI applications, using two replicas 
per MPI process. They provide a theoretical analysis of parallel efficiency, an MPI implementation 
that supports transparent process replication (including error detection, consistent message ordering 
among replicas, etc.), and a set of experimental and simulation results. Thread-level replication has 
also been investigated [47, 14, 35]. The present work targets selective task replication as opposed 
to full task replication in conjunction with selective task checkpointing to cope with fail-stop and 
silent errors and minimize makespan. 

Partial redundancy was also studied (in combination with coordinated checkpointing) to decrease 
the overhead of full replication [18, 38, 39]. Adaptive redundancy is introduced in [24], where a subset 
of processes is dynamically selected for replication. Earlier work [3] considered replication in the 
context of divisible load applications. In the present work, task replication (including work and data) 
is studied in the context of linear workflows, which represent a harder case than that of divisible 
load applications as tasks cannot arbitrarily be divided and are executed non-preemptively. 

Ni et al. [32] introduce process duplication to cope both with fail-stop and silent errors. Their 
pioneering paper contains many interesting results. It differs from this work in that they only 
consider perfectly parallel applications while we investigate herein per task speedup profiles that 
obey Amdahl’s law. More recently, Subasi et al. [40] proposed a software-based selective replication 
of task-parallel applications with both, fail-stop and silent errors. In contrast, the present work 
(i) considers dependent tasks such as found in applications consisting of linear workflows; and (ii) 
proposes an optimal dynamic programming algorithm to solve the combined selective replication and 
checkpointing problem. Combining replication with checkpointing has also been proposed in [36, 49, 
21] for HPC platforms, and in [27, 45] for grid computing. 


7 Conclusion 

In this work, we studied the combination of checkpointing and replication to minimize the execution 
time of linear workflows in environments prone to both fail-stop and silent errors. We introduced 
a sophisticated dynamic programming algorithm that solves the combined problem optimally, by 
determining which tasks to checkpoint and which tasks to replicate, in order to minimize the total 
execution time. This dynamic programming algorithm was validated through extensive simulations 
that reveal the conditions in which checkpointing, replication, or both lead to improved performance. 
We have observed that the gain over the checkpoint-only approach is quite significant, in particular 
when checkpointing is costly and error rates are high. 

Future work will address workflows whose dependence graphs are more complex than linear chains 
of tasks. Although an optimal solution seems hard to reach, the design of efficient heuristics that 
decide where to locate checkpoints and when to use replication, would prove highly beneficial for 
the efficient and reliable execution of HPC applications on current and future large-scale platforms. 
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